TakeHomeEx3

Author

Lin Lin

Objective definition:

FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.

FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure. The research below aim to help FishEye develop a new visual analytics approach to better understand fishing business anomalies.

We will use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.

Task 1: Use visual analytics to identify anomalies in the business groups present in the knowledge graph.

Task 2: Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user.

  #The grouping will be done visually
  #we need to find groups and business groups, we have to do a bit of text sensing

Data Pre-processing and cleaning

Load the library and read the json relationship file MC2.

  #echo | false
  #tidytext -- text mining library with R: https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
  #Load Libraries   
  pacman::p_load(jsonlite,tidygraph, ggraph, visNetwork, tidyverse, shiny, plotly, graphlayouts, ggforce, tidytext,skimr)   
  #load Data   
  MC3<- fromJSON("data/MC3.json")

Pick desired fields in MC3

We picked the desired fields and reorganized the columns using select function. The nodes in MC3 will be companies or person, and description about companies, with their product and services, country and revenue generated.

As we load the data, we found this diagram is not directed, so we will not know the in/out direction of connection.

Below code extract out nodes and edges out for further processing.

  #glimpse(MC3)
  MC3_nodes <- as_tibble(MC3$nodes)
  colSums(is.na(MC3_nodes))
         country               id product_services      revenue_omu 
               0                0                0                0 
            type 
               0 
  #extract and mutate the format so it's not list but dataframe
  MC3_nodes_clean <- MC3_nodes %>% mutate(country = as.character(country),
                                          id = as.character(id),
                                          product_services = as.character(product_services),
                                          revenue_omu = as.numeric(as.character(revenue_omu)),    #we need to convert to numeric directly
                                          type = as.character(type)) %>%
    select(id, country, type, revenue_omu, product_services)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `revenue_omu = as.numeric(as.character(revenue_omu))`.
Caused by warning:
! NAs introduced by coercion

The original data do not have NA value, however by transforming data into table format, some fields are NA.

  #check data quality=
  colSums(is.na(MC3_nodes_clean))
              id          country             type      revenue_omu 
               0                0                0            21515 
product_services 
               0 
  #check which are the types?
  unique(MC3_nodes_clean$type)
[1] "Company"          "Company Contacts" "Beneficial Owner"
  skim(MC3_nodes_clean)
Data summary
Name MC3_nodes_clean
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁
  DT::datatable(MC3_nodes_clean)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
  MC3_edges <- as_tibble(MC3$links) %>% 
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
    summarise(weights = n()) %>%
  filter(source!=target) %>%
  ungroup()
`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
  #check missing value
  colSums(is.na(MC3_edges))
 source  target    type weights 
      0       0       0       0 

There is no missing value in edges data. Explore the dataset.

  DT::datatable(MC3_edges)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
  #check which are the types?
  unique(MC3_edges$type)
[1] "Company Contacts" "Beneficial Owner"
  #datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.

In order to find the business group, we will check the type of different category of data. There might be owner - business, customer - business, business - business relationship

  ggplot(data = MC3_edges,
       aes(x = type)) +
  geom_bar()

  MC3_edges_clean <- MC3_edges %>% mutate(source = as.character(source),
                       target = as.character(target),
                       type = as.character(target)) %>%
  group_by(source, target, type) %>%
  summarise(weights = n()) %>%
  filter(source!=target) %>%
  ungroup()
`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.

Handling missing values: for some or the product/services, there’s blank value such as “character(0)”, we recode these value to NA these value before pass them for text sensing.:

  # Recode "character(0)" to NA in the product_services column
  MC3_nodes_clean$product_services[MC3_nodes_clean$product_services == "character(0)"] <- NA
  
  ggplot(data = MC3_nodes_clean,
         aes(x = type)) +
    geom_bar()

Building network model with tidygraph

  id1 <- MC3_edges_clean %>%
  select(source) %>%
  rename(id = source)
  id2 <- MC3_edges_clean %>%
    select(target) %>%
    rename(id = target)
  MC3_nodes1 <- rbind(id1, id2) %>%
    distinct() %>%
    left_join(MC3_nodes_clean,
              unmatched = "drop")
Joining with `by = join_by(id)`
  mc3_graph <- tbl_graph(nodes = MC3_nodes1,
                       edges = MC3_edges_clean,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())
  mc3_graph %>%
  filter(betweenness_centrality >= 100000) %>%
  ggraph(layout = "fr") +
    geom_edge_link(aes(alpha=0.5)) +
    geom_node_point(aes(
      size = betweenness_centrality,
      colors = "lightblue",
      alpha = 0.5)) +
    scale_size_continuous(range=c(1,10))+
    theme_graph()
Warning in geom_node_point(aes(size = betweenness_centrality, colors =
"lightblue", : Ignoring unknown aesthetics: colours
Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Text sensing with tidytext

word count

  #start a bit of text sensing, display the result by max value first

  MC3_nodes_clean <- MC3_nodes_clean %>% 
  mutate(n_fish = str_count(product_services, "fish")) %>%
  arrange(desc(n_fish))

  library(ggplot2)

  # MC3_nodes_clean <- MC3_nodes_clean %>%
  #   group_by(type) %>%
  #   summarize(n_fish = sum(str_count(product_services, "fish")), .groups = "drop")
  
  ggplot(data = MC3_nodes_clean, aes(x = type, y = n_fish)) +
  geom_bar(stat = "identity")
Warning: Removed 18959 rows containing missing values (`position_stack()`).

Tokenisation

In text sensing, tokenisation is the process of breaking up a given text into units called tokens. We will discard characters like punctuation marks in this progress.

The two basic arguments to unnest_tokens() used here are column names. First argument is the output column name that will be created as the text is unnested into it, and then the input column that the text comes from (product_services, in this case).

  nodesToken <- MC3_nodes_clean %>%
  unnest_tokens (word, product_services)
  #can add in to_lower = TRUE
  # add in strip_punct = TRUE
  nodesToken %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")
Selecting by n

With above we saw the top frequently sensed word may not be useful. For example, NA, “a” and “to”. We will need to remove these words as stop words.

From the token generated, we need to take out the common/generic words and we will also exclude NA records.

  tidy_stopwords <- nodesToken %>%
  anti_join(stop_words)%>%
  na.omit()
Joining with `by = join_by(word)`

Visualization with bar chart after remove stopword

  tidy_stopwords %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")
Selecting by n

    #use parallel cordinate to visualize

  # library(cluster)
  # library(caret)
  # 
  # MC3_nodes <- MC3_nodes %>%
  # select(product_services, country, revenue_omu, type) %>%
  # na.omit()
  # 
  # 
  # 
  # #prepare data
  # clustering_data <- MC3_nodes[, c("product_services", "revenue_omu", "type")]
  # #try K means clustering
  # k <- 4  # Number of clusters
  # set.seed(123)  # For reproducibility
  # kmeans_result <- kmeans(clustering_data, centers = k)
  # 
  # MC3_nodes$cluster <- as.factor(kmeans_result$cluster)
  # cluster_summary <- aggregate(clustering_data, by = list(cluster = MC3_nodes$cluster), FUN = mean)
  # 
  # 
  # 
  # pacman::p_load(GGally, parallelPlot)
  # library(GGally)
  # ggparcoord(MC3_nodes[, c("product_services", "country", "revenue_omu", "type","cluster")], 
  #          columns = 1:3, groupColumn = "cluster", 
  #          title = "Parallel Coordinate Plot: Features by Cluster")
  # 

ploting relationship?

TODO - failed need troubleshoot

  # GraphMC3 <- tbl_graph(nodes = MC3_nodes_clean,
  #                       edges = MC3_edges_clean,
  #                          directed = FALSE)
  # #
  # GraphMC3
  #is_connected <- is.connected(GraphMC2)
  
  # peopleEntityRelationship %>%
  # activate(edges) %>%
  # arrange(desc(weightkg))